Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option #11213

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rouault
Copy link
Member

@rouault rouault commented Nov 5, 2024

  • GetArrowStream(): add DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11. Whether DateTime fields should be returned as a (normally ISO-8601 formatted) string by drivers. The aim is to be able to handle mixed timezones (or timezone naive values) in the same column. All drivers must honour that option, and potentially fallback to the OGRLayer generic implementation if they cannot (which is the case for the Arrow, Parquet and ADBC drivers).
    When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored.

Fixes geopandas/pyogrio#487

  • OGRLayer::GetArrowStream(): when DATETIME_AS_STRING=YES, expose "GDAL:OGR:type":"DateTime" metadata in the ArrowSchema of DateTime fields

  • CreateFieldFromArrowSchema(): take into account GDAL:OGR:Type=DataTime when ArrowSchema.format='u' (string)

  • ogr2ogr: GPKG/FlatGeoBuf -> other format: in Arrow code path, use DATETIME_AS_STRING to preserve origin timezone

Fixes #11212

@coveralls
Copy link
Collaborator

coveralls commented Nov 6, 2024

Coverage Status

coverage: 73.625% (-0.001%) from 73.626%
when pulling ac2995d on rouault:GetArrowStream_DATETIME_AS_STRING
into c578e9d on OSGeo:master.

DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11.
Whether DateTime fields should be returned as a (normally ISO-8601
formatted) string by drivers. The aim is to be able to handle mixed
timezones (or timezone naive values) in the same column.
All drivers must honour that option, and potentially fallback to the
OGRLayer generic implementation if they cannot (which is the case for the
Arrow, Parquet and ADBC drivers).
When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored.

Fixes geopandas/pyogrio#487
@rouault rouault force-pushed the GetArrowStream_DATETIME_AS_STRING branch from ae77c2f to ad41bb8 Compare November 6, 2024 01:26
@rouault
Copy link
Member Author

rouault commented Nov 6, 2024

@theroggy @jorisvandenbossche I'm thinking that in this DATETIME_AS_STRING=YES mode, in the ArrowSchema of datetime fields exposed as string (format='u'), we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

@jorisvandenbossche
Copy link
Contributor

Thanks a lot for looking into this!

we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

Would you just want to indicate that the original GDAL/OGR type was a DateTime? Or is there more information about the column that GDAL can know at that point?
For the type, maybe something like "gdal:type": "DateTime" ? (there is not yet any precedence where you store some information like this is any file format?)

@rouault
Copy link
Member Author

rouault commented Nov 6, 2024

Would you just want to indicate that the original GDAL/OGR type was a DateTime?

actually, I'm just remembering that we have already something. https://gdal.org/en/latest/doxygen/classOGRLayer.html#a3ffa8511632cbb7cff06a908e6668f55 mentions:

Starting with GDAL 3.8, the ArrowSchema::metadata field filled by the get_schema() callback may be set with the potential following items:
    "GDAL:OGR:alternative_name": value of OGRFieldDefn::GetAlternativeNameRef()
    "GDAL:OGR:comment": value of OGRFieldDefn::GetComment()
    "GDAL:OGR:default": value of OGRFieldDefn::GetDefault()
    "GDAL:OGR:subtype": value of OGRFieldDefn::GetSubType()
    "GDAL:OGR:width": value of OGRFieldDefn::GetWidth() (serialized as a string)
    "GDAL:OGR:unique": value of OGRFieldDefn::IsUnique() (serialized as "true" or "false")
    "GDAL:OGR:domain_name": value of OGRFieldDefn::GetDomainName()

Those are only filled when they cannot be expressed with an Arrow concept.
So logically that should be extended with "GDAL:OGR:type": "DateTime" in that situation

…:OGR:type":"DateTime" metadata in the ArrowSchema of DateTime fields
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
funded through GSP Work funded through the GDAL Sponsorship Program
Projects
None yet
3 participants